Search CORE

179 research outputs found

Knowledge Organization Research in the last two decades: 1988-2008

Author: Ibekwe-Sanjuan Fidelia
Sanjuan Eric
Publication venue
Publication date: 28/02/2010
Field of study

We apply an automatic topic mapping system to records of publications in knowledge organization published between 1988-2008. The data was collected from journals publishing articles in the KO field from Web of Science database (WoS). The results showed that while topics in the first decade (1988-1997) were more traditional, the second decade (1998-2008) was marked by a more technological orientation and by the appearance of more specialized topics driven by the pervasiveness of the Web environment

arXiv.org e-Print Archive

HAL

HAL-Lyon 3

The landscape of Information Science: 1996-2008

Author: Ibekwe-Sanjuan Fidelia
Sanjuan Eric
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/06/2009
Field of study

International audienceWe propose a methodology combining symbolic and numeric information to map the structure of research in Information Science between 1996-2008. The visualization of the resulting maps showed that while the two-camp structure of Information Science observed in previous studies is still valid, other research poles like web and user-oriented studies are building bridges between the two hitherto isolated poles

HAL

HAL-Lyon 3

Combining Language Models with NLP and Interactive Query Expansion.

Author: Ibekwe-Sanjuan Fidelia
Sanjuan Eric
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/12/2009
Field of study

International audienceFollowing our previous participation in INEX 2008 Ad-hoc track, we continue to address both standard and focused retrieval tasks based on comprehensible language models and interactive query expansion (IQE). Query topics are expanded using an initial set of Multiword Terms (MWTs) selected from top n ranked documents. In this experiment, we extract MWTs from article titles, narrative field and automatically generated summaries. We combined the initial set of MWTs obtained in an IQE process with automatic query expansion (AQE) using language models and smoothing mechanism. We chose as baseline the Indri IR engine based on the language model using Dirichlet smoothing. We also compare the performance of bag of word approaches (TFIDF and BM25) to search strategies elaborated using language model and query expansion (QE). The experiment is carried out on all INEX 2009 Ad-hoc tasks

HAL

HAL-Lyon 3

Text mining without document context

Author: Ibekwe-SanJuan Fidelia
SanJuan Eric
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

We consider a challenging clustering task: the clustering of muti-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices

Annotation of Scientific Summaries for Information Retrieval.

Author: Eric Charton
Ibekwe-Sanjuan Fidelia
Sanjuan Eric
Silvia Fernandez
Publication venue: HAL CCSD
Publication date: 30/03/2008
Field of study

International audienceWe present a methodology combining surface NLP and Machine Learning techniques for ranking asbtracts and generating summaries based on annotated corpora. The corpora were annotated with meta-semantic tags indicating the category of information a sentence is bearing (objective, findings, newthing, hypothesis, conclusion, future work, related work). The annotated corpus is fed into an automatic summarizer for query-oriented abstract ranking and multi- abstract summarization. To adapt the summarizer to these two tasks, two novel weighting functions were devised in order to take into account the distribution of the tags in the corpus. Results, although still preliminary, are encouraging us to pursue this line of work and find better ways of building IR systems that can take into account semantic annotations in a corpus

HAL

HAL-Lyon 3

Decomposition of terminology graphs for domain knowledge acquisition.

Author: Ibekwe-Sanjuan Fidelia
Sanjuan Eric
Vogeley Michael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/10/2008
Field of study

International audienceWe propose a graph decomposition algorithm for analyzing the structure of complex graph networks. After multi-word term extraction, we apply techniques from text mining and visual analytics in a novel way by integrating symbolic and numeric information to build clusters of domain topics. Terms are clustered based on surface linguistic variations and clusters are inserted in an association network based on their intersection with documents. The graph is then decomposed based on atom graph structure into central (non-decomposable) atom and peripheral atoms. The whole process is applied to publications from the Sloan Digital Sky Survey (SDSS) project in the Astronomy field. The mapping obtained was evaluated by a domain expert and appeared to have captured interesting conceptual relations between different domain topics

HAL

HAL-Lyon 3

SDOC et TermWatch : deux méthodes complémentaires de cartographie de thèmes

Author: Ibekwe-Sanjuan Fidelia
Polanco Xavier
Sanjuan Eric
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

Le but de cette communication est de comparer deux méthodes initialement destinées à la veille scientifique et technique dans une application de fouille de textes. Les deux méthodes proposent à l'utilisateur de visualiser les résultats d'une classification hiérarchique non supervisée de données textuelles sous forme d'une carte thématique. Elles sont cependant complémentaires puisque l'une, SDOC, est fondé sur l'analyse de la matrice de co-occurences et positionne les classes (clusters) sur le plan en fonction de leurs propriétés structurelles, tandis que l'autre, TermWatch, classifie les termes en fonction de leurs seuls liens de variation syntaxique et présente les résultats sous forme d'un réseau visualisable avec le logiciel AiSee, dont les liens sont d'autant plus resserrés que les classes sont supposées être thématiquement proches

Identifying Thematic Variations in SDSS research.

Author: Chen Chaomei
Ibekwe-Sanjuan Fidelia
Sanjuan Eric
Vogeley Michael
Publication venue: Presses Universitaires de Lyon
Publication date: 12/03/2008
Field of study

International audienceThe Sloan Digital Sky Survey (SDSS) is the largest ongoing sky survey. It regularly makes data releases to the astronomical community. From a macroscopic point of view, a profound question is: what is the role of SDSS data releases in the evolution of the relevant scientific fields? In this paper, we introduce an integrated approach by combining statistical, information-theoretical, and symbolic methods for text data analysis and show how this combined approach can distinguish thematic variations associated with the different data releases

HAL

HAL-Lyon 3